Modernize code generation for the external LLVM 22 back-end by maleadt · Pull Request #3169 · JuliaGPU/CUDA.jl

maleadt · 2026-06-06T14:56:39Z

Machine code generation goes through an external LLVM 22 llc now, with the in-process LLVM only driving the middle end. That makes a bunch of old workarounds unnecessary, and unlocks some functionality:

llvm_compat reports the capabilities of NVPTX_LLVM_Backend_jll instead of the in-process LLVM, so PTX target selection isn't held back by the Julia-bundled version anymore. This adds sm_88 and sm_110 support.
llvm.-prefixed declarations the in-process LLVM doesn't recognize no longer trigger device-runtime linking; the back-end lowers them.
Dropped the fast min/max workaround for LLVM 18- generates non-existing min.NaN.f64/max.NaN.f64 instructions #2886, which is fixed in LLVM 21+.
Float16 atomic addition uses atomicrmw fadd instead of inline assembly, and BFloat16 gets native atomic add/sub on Julia 1.11+ (with a CAS fallback below sm_70 resp. sm_90).
active_mask() calls llvm.nvvm.activemask on LLVM 20+. The inline-assembly fallback for older versions is marked side-effecting, as it could previously be hoisted or merged across divergent control flow.
Fast Float32 exp2 uses the ex2.approx intrinsic.

One caveat: llc recomputes the data layout from the triple, ignoring the module's, so 128-bit integers are always 16-byte aligned on the device. Julia only aligns them that way since 1.12, meaning aggregates with (U)Int128 fields may lay out differently on older hosts. Kernel arguments with such layout mismatches are now rejected with an error pointing at Julia 1.12.

Also includes a test guarding against dynamically-indexed aggregate arguments being copied to local memory (the regression fixed by llvm/llvm-project#201772), and updates the fdiv/rcp PTX tests for the new back-end's lowering (inv now selects rcp instructions, and fast Float64 division gets Newton refinement).

Machine code is generated by an external, up-to-date LLVM, so target selection should not be limited by the in-process LLVM version (which only drives the middle end, and is not configured for any particular device). This makes the back-end compile natively for recent devices, e.g., sm_120a instead of sm_90 with a rewritten PTX header on Blackwell, and unlocks newer PTX ISAs.

Intrinsics unknown to the in-process LLVM, e.g. selected by libdevice's __CUDA_ARCH dispatch, were counted as undefined functions, needlessly compiling for relocatable code and linking against cudadevrt.

Old LLVM back-ends generated nonexistent min.NaN/max.NaN instructions for fast fp64 min/max and fp16 minimum/maximum (#2886); the external back-end lowers these correctly for every subtarget.

Plain atomicrmw fadd gives LLVM real atomic semantics instead of an opaque asm blob, generating the same instructions while remaining optimizable; the back-end also expands it on devices without native support. BFloat16 atomic addition is new (sm_90 hardware, expanded elsewhere), and requires Julia 1.11 for bfloat codegen support.

The intrinsic has no side effects, unlike the inline assembly it replaces, so it can be CSE'd, hoisted, and constant-folded.

The inline assembly lacked the side-effect flag, allowing LLVM to merge or hoist it across divergent control flow. Use the convergent intrinsic where available (LLVM 20), and mark the assembly side-effecting before.

The back-end aligns 128-bit integers to 16 bytes, but Julia versions before 1.12 align them to 8, so aggregates with (U)Int128 fields can lay out differently on host and device. These used to be compiled quietly, reading garbage on the device; error instead.

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `356c85b`	Previous: `112549e`	Ratio
`array/accumulate/Float32/1d`	`100163` ns	`99517` ns	`1.01`
`array/accumulate/Float32/dims=1`	`75240` ns	`75910` ns	`0.99`
`array/accumulate/Float32/dims=1L`	`1628693` ns	`1594980` ns	`1.02`
`array/accumulate/Float32/dims=2`	`140970` ns	`141259` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`652567` ns	`653724` ns	`1.00`
`array/accumulate/Int64/1d`	`118755` ns	`118852` ns	`1.00`
`array/accumulate/Int64/dims=1`	`79140` ns	`79413` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`1723746` ns	`1709492` ns	`1.01`
`array/accumulate/Int64/dims=2`	`153114` ns	`154250` ns	`0.99`
`array/accumulate/Int64/dims=2L`	`960242` ns	`960390` ns	`1.00`
`array/broadcast`	`18384` ns	`18461` ns	`1.00`
`array/construct`	`1197.5` ns	`1193` ns	`1.00`
`array/copy`	`16621` ns	`16550` ns	`1.00`
`array/copyto!/cpu_to_gpu`	`213583` ns	`214764` ns	`0.99`
`array/copyto!/gpu_to_cpu`	`278812` ns	`280613` ns	`0.99`
`array/copyto!/gpu_to_gpu`	`10254` ns	`10344` ns	`0.99`
`array/iteration/findall/bool`	`133353` ns	`134100` ns	`0.99`
`array/iteration/findall/int`	`147614` ns	`147421` ns	`1.00`
`array/iteration/findfirst/bool`	`69959` ns	`112673` ns	`0.62`
`array/iteration/findfirst/int`	`71112` ns	`112820` ns	`0.63`
`array/iteration/findmin/1d`	`67998` ns	`67036` ns	`1.01`
`array/iteration/findmin/2d`	`101335` ns	`100960` ns	`1.00`
`array/iteration/logical`	`193754` ns	`193400` ns	`1.00`
`array/iteration/scalar`	`65567` ns	`64965` ns	`1.01`
`array/permutedims/2d`	`49616` ns	`49581` ns	`1.00`
`array/permutedims/3d`	`50731` ns	`50662` ns	`1.00`
`array/permutedims/4d`	`50885` ns	`50962` ns	`1.00`
`array/random/rand/Float32`	`11550` ns	`12069` ns	`0.96`
`array/random/rand/Int64`	`22788` ns	`24024` ns	`0.95`
`array/random/rand!/Float32`	`7837.333333333333` ns	`8798.666666666666` ns	`0.89`
`array/random/rand!/Int64`	`17838` ns	`20664` ns	`0.86`
`array/random/randn/Float32`	`35484` ns	`35378` ns	`1.00`
`array/random/randn!/Float32`	`23789` ns	`23654` ns	`1.01`
`array/reductions/mapreduce/Float32/1d`	`33624` ns	`33516` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1`	`38432` ns	`38509` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1L`	`50250` ns	`50248` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2`	`56205` ns	`55822` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=2L`	`67291` ns	`67519` ns	`1.00`
`array/reductions/mapreduce/Int64/1d`	`39237` ns	`39436` ns	`0.99`
`array/reductions/mapreduce/Int64/dims=1`	`41561` ns	`41192` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=1L`	`86410` ns	`86477` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`58727` ns	`57772` ns	`1.02`
`array/reductions/mapreduce/Int64/dims=2L`	`82371` ns	`83119` ns	`0.99`
`array/reductions/reduce/Float32/1d`	`33454` ns	`33724` ns	`0.99`
`array/reductions/reduce/Float32/dims=1`	`38555` ns	`38486` ns	`1.00`
`array/reductions/reduce/Float32/dims=1L`	`50180` ns	`50211` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`56128` ns	`55852` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`66986` ns	`69022` ns	`0.97`
`array/reductions/reduce/Int64/1d`	`39115` ns	`39412` ns	`0.99`
`array/reductions/reduce/Int64/dims=1`	`41248` ns	`40972` ns	`1.01`
`array/reductions/reduce/Int64/dims=1L`	`86521` ns	`86447` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`58247` ns	`57742` ns	`1.01`
`array/reductions/reduce/Int64/dims=2L`	`83524` ns	`82671` ns	`1.01`
`array/reverse/1d`	`16824` ns	`16903` ns	`1.00`
`array/reverse/1dL`	`67720` ns	`67929` ns	`1.00`
`array/reverse/1dL_inplace`	`65187` ns	`65328` ns	`1.00`
`array/reverse/1d_inplace`	`8317.666666666666` ns	`10020.333333333334` ns	`0.83`
`array/reverse/2d`	`20065` ns	`20099` ns	`1.00`
`array/reverse/2dL`	`71781` ns	`71890` ns	`1.00`
`array/reverse/2dL_inplace`	`64950` ns	`65089` ns	`1.00`
`array/reverse/2d_inplace`	`9543` ns	`9724` ns	`0.98`
`array/sorting/1d`	`2654742` ns	`2658878` ns	`1.00`
`array/sorting/2d`	`1033402` ns	`1040327` ns	`0.99`
`array/sorting/by`	`3180132` ns	`3193494` ns	`1.00`
`cuda/synchronization/context/auto`	`1158.7` ns	`1122.1` ns	`1.03`
`cuda/synchronization/context/blocking`	`931.6296296296297` ns	`908.9714285714285` ns	`1.02`
`cuda/synchronization/context/nonblocking`	`6052.6` ns	`6022.8` ns	`1.00`
`cuda/synchronization/stream/auto`	`989.4` ns	`993.9` ns	`1.00`
`cuda/synchronization/stream/blocking`	`837.8115942028985` ns	`827.3783783783783` ns	`1.01`
`cuda/synchronization/stream/nonblocking`	`5901.6` ns	`5915` ns	`1.00`
`integration/byval/reference`	`143146` ns	`142979` ns	`1.00`
`integration/byval/slices=1`	`145285` ns	`145133` ns	`1.00`
`integration/byval/slices=2`	`283812` ns	`283763` ns	`1.00`
`integration/byval/slices=3`	`422279` ns	`422104` ns	`1.00`
`integration/cudadevrt`	`101563` ns	`101484` ns	`1.00`
`integration/volumerhs`	`8997466` ns	`9077118` ns	`0.99`
`kernel/indexing`	`12705` ns	`12734` ns	`1.00`
`kernel/indexing_checked`	`13427` ns	`13463` ns	`1.00`
`kernel/launch`	`2058` ns	`2146.3333333333335` ns	`0.96`
`kernel/occupancy`	`724.4328358208955` ns	`688.7569444444445` ns	`1.05`
`kernel/rand`	`15233` ns	`14254` ns	`1.07`
`latency/import`	`3850082067` ns	`3847133996` ns	`1.00`
`latency/precompile`	`4630800385` ns	`4625229019` ns	`1.00`
`latency/ttfp`	`4521745566` ns	`4496455873` ns	`1.01`

This comment was automatically generated by workflow using github-action-benchmark.

The external back-end selects fast minnum/minimum to single min/max instructions instead of compare + select, picking the NaN-propagating variants where available. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Julia's floating-point min/max follow IEEE 754-2019 minimum/maximum semantics, which map directly onto these intrinsics. The external back-end legalizes them on every subtarget (min.NaN/max.NaN on sm_80+, a NaN/signed-zero-correct expansion elsewhere), so drop the manual emulation based on __nv_fmin plus a NaN fix-up. That emulation also inherited llvm.minnum's loose signed-zero semantics, causing constant folding to break the -0.0 < +0.0 ordering on device. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

device_layout called sizeof on every zero-field DataType, but types like Symbol don't have a definite size. Non-isbits arguments are passed by reference, so their layout is Julia's business on both sides; treat them as opaque. Only affected Julia 1.10/1.11, where the layout check is active. Also add tests for the Int128 layout rejection itself. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Now that targets are selected based on the back-end LLVM, recent devices compile natively (e.g. sm_120a) rather than for an older baseline. Adjust the feature-set expectation to consult the back-end version, and accept the wider vector accesses (v2.b64) such targets prefer over v4.b32. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-07T12:09:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.32%. Comparing base (aa47d7a) to head (356c85b).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3169      +/-   ##
==========================================
- Coverage   16.33%   16.32%   -0.02%     
==========================================
  Files         124      124              
  Lines        9875     9875              
==========================================
- Hits         1613     1612       -1     
- Misses       8262     8263       +1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

maleadt added 9 commits June 5, 2026 20:31

Don't link the device runtime for back-end intrinsics.

0c67754

Intrinsics unknown to the in-process LLVM, e.g. selected by libdevice's __CUDA_ARCH dispatch, were counted as undefined functions, needlessly compiling for relocatable code and linking against cudadevrt.

Drop the fast min/max workaround.

d70b655

Old LLVM back-ends generated nonexistent min.NaN/max.NaN instructions for fast fp64 min/max and fp16 minimum/maximum (#2886); the external back-end lowers these correctly for every subtarget.

Use the ex2.approx intrinsic for fast Float32 exp2.

c151de6

The intrinsic has no side effects, unlike the inline assembly it replaces, so it can be CSE'd, hoisted, and constant-folded.

Fix hoistable active_mask.

2e134a3

The inline assembly lacked the side-effect flag, allowing LLVM to merge or hoist it across divergent control flow. Use the convergent intrinsic where available (LLVM 20), and mark the assembly side-effecting before.

Test that aggregate arguments stay out of local memory.

01a4bc8

Update test.

74d8569

maleadt changed the title ~~Modernize for LLVM 22~~ Modernize code generation for the external LLVM 22 back-end Jun 6, 2026

github-actions Bot reviewed Jun 6, 2026

View reviewed changes

maleadt and others added 4 commits June 6, 2026 18:02

maleadt merged commit 1caa3ad into main Jun 8, 2026
2 checks passed

maleadt deleted the tb/ptx_llvm22 branch June 8, 2026 04:51

maleadt added a commit to JohnCobbler/CUDA.jl that referenced this pull request Jun 11, 2026

Merge main to pick up backend test fixes (JuliaGPU#3169)

3c55b6c

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modernize code generation for the external LLVM 22 back-end#3169

Modernize code generation for the external LLVM 22 back-end#3169
maleadt merged 13 commits into
mainfrom
tb/ptx_llvm22

maleadt commented Jun 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

codecov Bot commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maleadt commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

codecov Bot commented Jun 7, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maleadt commented Jun 6, 2026 •

edited

Loading

github-actions Bot left a comment •

edited

Loading